Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

124

Applications in Natural Language Processing

TABLE 5.1

Performance of our quantization method on the WMT14 EN-DE and WMT14 EN-FR

test set.

Model Method

Precision

EN-DE

EN-FR

PPL BLEU Size (Gb) Compr. PPL BLEU Size (Gb) Compr.

Base

Baseline

32-bit

4.95 26.46

2.02

3.21 38.34

1.94

Default Approach

8-bit

74.04 0.21

0.52

3.91x

nan

0.50

3.91x

Post-Quantization

8-bit

4.97 26.44

0.52

3.91x

3.26 38.30

0.50

3.91x

FullyQT

8-bit

4.94 26.38

0.52

3.91x

3.23 38.41

0.50

3.91x

Post-Quantization

6-bit

6.00 24.84

0.39

5.18x

3.98 35.02

0.37

5.17x

FullyQT

6-bit

5.09 26.98

0.39

5.18x

3.38 37.07

0.37

5.17x

FullyQT

4-bit

11.96 18.32

0.26

7.66x 48.21 1.59

0.25

7.64x

Big

Baseline

32-bit

4.38 27.13

6.85

2.77 40.54

6.69

Post-Quantization

8-bit

4.27 26.55

1.74

3.95x

2.78 39.78

1.69

3.95x

FullyQT

8-bit

4.57 26.96

1.74

3.95x

2.80 40.25

1.69

3.95x

Post-Quantization

6-bit

5.12 24.86

1.31

5.24x

3.08 37.92

1.28

5.24x

FullyQT

6-bit

4.78 26.76

1.31

5.24x

2.87 39.59

1.28

5.24x

FullyQT

4-bit

33.11 10.22

0.88

7.79x 42.42 2.81

0.86

7.79x

for all weight matrices. For activations, they use tensor bucketing for the following ten-

sors: the sum of input embeddings with the positional encoding, the Q, K, V inputs, the

scaled dot-product attention’s output, the feed-forward’s output, the LayerNorm’s numer-

ator, quotient, and output.

5.2.4

Dealing with Zeros

Unlike the classic quantization method proposed in [104], they do not nudge the domain

so that the zero value gets perfectly mapped. Speciﬁcally, the only zero values are the

padding, the Softmax numerator, and output, the output of ReLU layers, and dropouts.

Since padding does not aﬀect the ﬁnal output, they ignore these values when quantizing. For

the quantization parameter, xmin for ReLUs and the Softmax’s numerator and output are

ﬁxed to 0, guaranteeing the perfect value mapping. Finally, quantization is applied before

any dropout operation.

In Table 5.1 shows the performance of the proposed method on the WMT14 EN-DE

and WMT14 EN-FR. They compare results with two full-precision Transformers: base and

big variants. Two other quantization approaches are evaluated. The ﬁrst is the “default”

approach, which naively quantizes every possible operation. The second approach applies

the proposed quantization strategy post-training. In all cases except for post-quantization,

BLEU was computed on the test set using the checkpoint which scored the highest accuracy

on the validation set. Towards the end of training, they ran one validation epoch for every

100 training steps. Baselines and FullyQT 8-bit results were averaged over 5 trials. Standard

deviation of the BLEU scores did not seem higher for any method and ranged between 0.09

and 0.51. Training with quantization was about twice as slow as with the baselines. As for

post-training quantization, the BLEU score was computed on the test set using the best

validation performance out of 20 trials. The default approach’s nan in the EN-FR task

is due to numerical instability. By quantizing every operation, zeros in the LayerNorm’s

denominator are more frequent.

In summary, this paper’s contributions are as follows: (1) a uniform quantization scheme;

(2) a detailed demonstration of the choice of quantized layer; (3) a tensor bucketing method

for achieving higher precision; and (4) a special design for zeros.